In [36]:
'''
Importing libraries
- Creating data sets
- Creating data frames
- Reading from CSV
- Exporting to CSV
- Finding maximums
- Plotting data
Create Data - We begin by creating our own data set for analysis.
This prevents the end user reading this tutorial from having to download any files to replicate the results below.
We will export this data set to a text file so that you can get some experience pulling data from a text file.
Get Data - We will learn how to read in the text file. The data consist of baby names and the number of baby names born in the year 1880.
Prepare Data - Here we will simply take a look at the data and make sure it is clean. By clean I mean we will take a look inside the contents of the text file and look for any anomalities. These can include missing data, inconsistencies in the data, or any other data that seems out of place. If any are found we will then have to make decisions on what to do with these records.
Analyze Data - We will simply find the most popular name in a specific year.
Present Data - Through tabular data and a graph, clearly show the end user what is the most popular name in a specific year.
'''
# Enable inline plotting
%matplotlib inline
# General syntax to import specific functions in a library:
##from (library) import (specific library function)
from pandas import DataFrame, read_csv
# General syntax to import a library but no functions:
##import (library) as (give the library a nickname/alias)
import matplotlib.pyplot as plt
import pandas as pd #this is how I usually import pandas
import sys #only needed to determine Python version number
import matplotlib #only needed to determine Matplotlib version number
In [37]:
print('Python version ' + sys.version)
print('Pandas version ' + pd.__version__)
print('Matplotlib version ' + matplotlib.__version__)
Create Data The data set will consist of 5 baby names and the number of births recorded for that year (1880).
In [38]:
'''
Create Data
'''
# The inital set of baby names and bith rates
names = ['Bob','Jessica','Mary','John','Mel']
births = [968, 155, 77, 578, 973]
In [43]:
BabyDataSet = list(zip(names,births))
BabyDataSet
Out[43]:
In [45]:
df = pd.DataFrame(data = BabyDataSet, columns=['Names', 'Births'])
df
Out[45]:
In [50]:
'''
Get Data
'''
df.to_csv?
In [51]:
df.to_csv('births1880.csv',index=False,header=False)
In [52]:
read_csv?
In [56]:
Location = r'C:\Users\cr\Documents\UCM 4\MD\teamMin\tutorial_pandas\births1880.csv'
df = pd.read_csv(Location, header=None, names=['Names','Births'])
df
Out[56]:
In [57]:
import os
os.remove(Location)
In [58]:
'''
Prepare Data
'''
# Check data type of the columns
df.dtypes
Out[58]:
In [59]:
'''
Analyze Data
'''
# Check data type of Births column
df.Births.dtype
Out[59]:
In [62]:
# Method 1:
Sorted = df.sort_values(['Births'], ascending=False)
Sorted.head(1)
Out[62]:
In [63]:
# Method 2:
df['Births'].max()
Out[63]:
In [88]:
'''
Present Data
'''
# Create graph
df['Births'].plot()
# Maximum value in the data set
MaxValue = df['Births'].max()
# Name associated with the maximum value
MaxName = df['Names'][df['Births'] == df['Births'].max()].values
# Text to display on graph
Text = str(MaxValue) + " - " + MaxName
# Add text to graph
plt.annotate(Text, xy=(1, MaxValue), xytext=(8, 0),
xycoords=('axes fraction', 'data'), textcoords='offset points')
print("The most popular name")
df[df['Births'] == df['Births'].max()]
Out[88]: